Skip to content

[UX][k8s] show-gpus for all allowed contexts #5362

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Apr 29, 2025

Conversation

kyuds
Copy link
Collaborator

@kyuds kyuds commented Apr 25, 2025

Fixes #5313

Screenshot 2025-04-27 at 2 09 04 PM

IMPORTANT! Feedback Please
I didn't know whether it would be desirable, but I think it would be a good idea to show a final table that would merge all available accelerators across all k8s clusters? The user might want to only know the total availability across clusters. I think that might be a helpful summary feature. For instance, for the above image, we would have:

Kubernetes GPUs (all allowed contexts):
GPU: T4, TOTAL_GPUS: 2, TOTAL_FREE_GPUS: 2

EDIT (25.04.27): implemented above feature as suggested in below comment

Some Behavioral Changes:

  • specifying an accelerator and the kubernetes context (aka region) at the same time ignored the kubernetes context. Fixed that to take into account the k8s context.
  • in kubernetes_catalog. _list_accelerators(), the function was adding accelerator names to the dict and setting its availability to 0. I don't know if this is a real requirement (feedback would be great), but this creates an assertion error when checking the keyset sizes for accelerator counts, capacity, and available dicts when used in conjunction with a quantity filter. Therefore, added in checks to NOT add accelerator names to the dict when capacity is 0.
  • added a "total table" feature so that users can see a cumulative sum of GPUs across their registered k8s clusters in allowed contexts.

Tested (run the relevant ones):

  • Code formatting: install pre-commit (auto-check on commit) or bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: /smoke-test (CI) or pytest tests/test_smoke.py (local)
  • Relevant individual tests: /smoke-test -k test_name (CI) or pytest tests/test_smoke.py::test_name (local)
  • Backward compatibility: /quicktest-core (CI) or pytest tests/smoke_tests/test_backward_compat.py (local)

@kyuds kyuds marked this pull request as draft April 25, 2025 02:45
@kyuds kyuds marked this pull request as ready for review April 25, 2025 04:33
@SeungjinYang
Copy link
Collaborator

Re: UI, I do agree on having a table showing aggregated GPU availability across all clusters.

I actually think such table should be at the top, because the current UI places per-context GPU availability above node level availability, so it already has a flow of information going from more general -> more specific from top to bottom.

@kyuds
Copy link
Collaborator Author

kyuds commented Apr 27, 2025

Re: UI, I do agree on having a table showing aggregated GPU availability across all clusters.

I actually think such table should be at the top, because the current UI places per-context GPU availability above node level availability, so it already has a flow of information going from more general -> more specific from top to bottom.

implemented!

@kyuds kyuds requested a review from SeungjinYang April 29, 2025 00:16
Co-authored-by: Seung Jin <seungjin219@gmail.com>
@kyuds kyuds requested a review from SeungjinYang April 29, 2025 00:29
Copy link
Collaborator

@SeungjinYang SeungjinYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kyuds!

@SeungjinYang SeungjinYang merged commit 5fb3933 into skypilot-org:master Apr 29, 2025
22 checks passed
@kyuds kyuds deleted the k8s/gpus branch April 29, 2025 01:17
@Michaelvll
Copy link
Collaborator

Michaelvll commented May 2, 2025

I am trying the latest master with this PR in. The UX for a single k8s with GPUs seems a bit weird to me:
image

  1. Should we avoid the \n\n at the end?
  2. Should we avoid the --- at the end of the table?
  3. The color for Kubernetes per node accelerator availability does not align with the color scheme in our other UX. See the original UX below:
image

@concretevitamin
Copy link
Member

Minor to above: would be good to replace None with -.

@SeungjinYang
Copy link
Collaborator

SeungjinYang commented May 3, 2025

I actually think that if a node doesn't contain any GPU then it shouldn't show up on the table

@Michaelvll
Copy link
Collaborator

Michaelvll commented May 3, 2025

This PR also seems causing a backward compatibility issue due to the return value of the request from API server changes.

_get_kubernetes_realtime_gpu_table
    gpu_availability = models.RealtimeGpuAvailability(
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: RealtimeGpuAvailability.__new__() missing 2 required positional arguments: 'capacity' and 'available'

@Michaelvll
Copy link
Collaborator

Another UX feedback:

It seems currently we show:

Total table
Context 1 GPUs
Context 1 Nodes

Context 2 GPUs
Context 2 Nodes
...

For SkyPilot users, GPUs are much more important than node information. A better way to show this is to show:

Total table
Context 1 GPUs
Context 2 GPUs

# We can even combine the context 1 and context 2 nodes in a single table
Context 1 Nodes
Context 2 Nodes
...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[UX] sky show-gpus only show GPUs in one k8s cluster instead of all specified in allowed_contexts
4 participants